Single-step genomic BLUP with genetic groups and automatic adjustment for allele coding

Background Genomic estimated breeding values (GEBV) by single-step genomic BLUP (ssGBLUP) are affected by the centering of marker information used. The use of a fixed effect called J factor will lead to GEBV that are unaffected by the centering used. We extended the use of a single J factor to a group of J factors. Results J factor(s) are usually included in mixed model equations (MME) as regression effects but a transformation similar to that regularly used for genetic groups can be applied to obtain a simpler MME, which is sparser than the original MME and does not need computation of the J factors. When the J factor is based on the same structure as the genetic groups, then MME can be transformed such that coefficients for the genetic groups no longer include information from the genomic relationship matrix. We illustrate the use of J factors in the analysis of a Red dairy cattle data set for fertility. Conclusions The GEBV from these analyses confirmed the theoretical derivations that show that the resulting GEBV are allele coding independent when a J factor is used. Transformed MME led to faster computing time than the original regression-based MME.


Background
Single-step genomic BLUP (ssGBLUP) [1,2] requires that the pedigree and genomic relationship matrices are compatible [3]. Two measures of similarity have been considered [4]: averages of diagonal and all elements. These two statistics are affected by the completeness of pedigree information. In pedigree-based animal model evaluations, incomplete pedigree information is often modeled by genetic groups [5]. Elements of the genomic relationship matrix are typically computed using centered and scaled marker genotypes [6]. Both centering and scaling often depend on allele frequencies and are affected by the available animal genotypes and, when pedigree information is used in the allele frequency estimation, by the completeness of the pedigree. Thus, incomplete information can affect both the pedigree and the genomic relationship matrix.
Fernando et al. [7] proposed a marker-based singlestep model using Bayesian regression. When all the variance components are known, this model, hereafter called ssSNPBLUP, is equivalent to ssGBLUP. In their ssSN-PBLUP, the genomic estimated breeding values (GEBV) are made independent of the allele frequencies that are used for centering marker genotypes by a regression effect, hereafter called J factor, which adjusts the breeding values to the appropriate level [8]. This is similar to a simple genomic model without pedigree information, often called SNP-BLUP, where the marker effect solutions are independent of allele coding but, for the GEBV to be independent of allele coding, their level needs to be adjusted by a general mean [9]. Thus, in both ssGBLUP/ ssSNPBLUP and SNP-BLUP, estimating a fixed effect and adding its solution to the estimated genetic values allows the calculation of GEBV that are independent of the allele coding or centering of the genotypes used. Fitting a J factor in a single-step model has improved prediction accuracy when selection acted on the evaluated trait [8]. Furthermore, the use of a J factor has been observed to increase accuracy and lower bias in the analysis of simulated data [10].
In practice, the pedigrees are incomplete and animals with information descend from different base populations. For the pedigree-based animal models, Thompson [11] suggested the use of parent genetic groups to account for differences in genetic levels of the base populations. The genetic groups were modeled by regression coefficients. The resulting EBV were functions of genetic group solutions and additive genetic effects, similar to the J factor being part of the breeding value. Quaas and Pollak [12] showed that the so-called QP transformation can be used to model the genetic groups as unknown parent groups (UPG) in mixed model equations (MME). The use of the QP transformation allows a computationally efficient approach to include the regression effects of genetic groups in MME by augmenting the UPG into the inverse relationship matrix. Furthermore, the breeding values from MME by the QP transformation include the effect of genetic group information and, hence, there is no need to add the group effect solutions afterward to the estimated genetic effects. Misztal et al. [13] noted the computational difficulties of full QP transformation in ssGBLUP MME and discussed alternative approaches to fit genetic groups. Matilainen et al. [14] implemented the full QP transformation in multiple trait ssGBLUP of national dairy cattle fertility data with 11 traits. They observed that the full QP transformation guaranteed good convergence of the iterative method when solving the MME.
In this study, we use the J factor in the original ssGB-LUP model and extend the J factor approach to include the same structure as for the genetic groups. We derive simple MME by applying a QP-like transformation to the J factor and consider computational aspects of genomic relationship matrices in the transformed MME. We illustrate the effects of including the genetic groups and extended J factors on ssGBLUP using a Nordic Red dairy cattle fertility data set.

Single-step GBLUP model with genetic groups and J factors
We consider a single-trait single-step GBLUP (ssGBLUP) model: where b is a vector of fixed effects, c is an s by 1 vector of fixed genetic centering, i.e., J factor, regression effects [7], J is a q by s matrix of known coefficients, g is an r by 1 vector of random genetic group regression effects, Q is a q by r matrix of known coefficients, a is a q by 1 vector of random additive genetic effects, and e is a random residual vector. Matrix X relates fixed effects b and matrix W relates effects of centering Jc , genetic groups Qg and additive genetics a to appropriate observations in vector y . Matrix J has coefficients of genetic proportions in the s centering groups for the genotyped animals but imputed proportions for the non-genotyped animals. This matrix will be described below. The estimated fixed effects c allow to compute GEBV that will be unaffected by the centering of marker genotypes used when building the genomic relationship matrix, i.e., the GEBV will be free from the used allele coding. We assume Var(a) = Hσ 2 a and Var(e) = R . In the following derivations, we assume that the genetic groups are random with an expectation of zero and variance S . When fixed genetic groups are assumed, the resulting MME (below Eqs. (2-6)) contain neither S nor S −1 .
Matrix H −1 in the MME of ssGBLUP is according to [1,2]: where A is the full pedigree relationship matrix, G is the genomic relationship matrix, and A 22 is the pedigreebased relationship matrix of the genotyped animals. The genomic relationship matrix can be formed, for example, as G = ZD −1 Z ′ , where Z = M − P is a (centered) marker matrix of size n by m and D is a diagonal scaling matrix [6]. Each genotype value in the marker genotype matrix M is the number of alleles, with a value of 0 when the individual is homozygous for the first allele, 1 when the individual is heterozygous, and 2 when the individual is homozygous for the second allele. Matrix D is a diagonal scaling matrix. For example, the so-called VanRaden method 1 has D = kI , where k = m l=1 2p l (1 − p l ) and p l is the (base) population allele frequency for marker l . Here, we assume the ZD −1 Z ′ matrix to be non-singular but the following derivations allow more general definitions of the G matrix, and we will consider them later.
Values in the centering matrix P = 1v ′ often depend on the allele frequencies of the markers. For example, v = 2p ′ where p is an m by 1 vector of base population allele frequencies [9]. Fernando et al. [7] proposed to include a fixed regression effect in ssSNPBLUP such that (1) y = Xb + WJc + WQg + Wa + e,  12 is the pedigreebased relationship matrix between the non-genotyped (subscript 1) and genotyped (subscript 2) animals. A random J factor approach was presented for ssGBLUP in Vitezica et al. [3] and will be considered in the "Discussion" Section.
The ssSNPBLUP model by Fernando et al. [7] is a model equivalent to the ssGBLUP Model Eq. (1). Thus, following Fernando et al. [7], GEBV in Model Eq. (1) are computed as a d = J c + Q g + a , i.e., the J factor and the genetic groups are added to the additive genetic effects. GEBV a d are independent of the centering of marker genotypes used, i.e., allele coding, due to the presence of the fixed J factor solutions J c . In ssSNPBLUP, the marker genotypes are used as regression coefficients where the marker genotypes for the non-genotyped animals are imputed from the genotyped animals using the linear imputation formula in the imputation formula are used in the fixed J factor to "impute" the general mean from the genotyped animals to the non-genotyped animals. Consequently, any changes in the centering of the genotypes will change the additive genetic effect estimates a but changes due to the J factor estimates J c allow the GEBV a d to remain unchanged. This is like any linear model that has a fixed general mean, a linear shift in the regression coefficients will change the general mean estimate but lead to the same predicted observations as shown for SNP-BLUP in [9]. The independence of allele coding can be proved formally by generalizing the derivations for SNP-BLUP in [9]. The allele coding independence will also be realized in ssGBLUP, because ssSNPBLUP and ssGBLUP are equivalent.
We generalize the fixed J factor approach from a single regression effect to s regression effects that may depend on the pedigree structure or predefined group status such as birth year or breed. Let the coefficient matrix J of the regression effect c be minus one times matrix Q c for the genotyped animals and −A 12 A −1 22 Q c for the non-genotyped animals: J = −A 12 A −1

22
−I Q c where Q c is an n g by s matrix having coefficients for the genotyped animals in the J factor groups and n g is the number genotyped animals. We assume that the sums of the rows of the Q c matrix equal 1, i.e., Q c 1 = 1 , and every element in Q c is within the interval [0,1]. The generalization from a single to multiple J factors makes the need to account for differences in centering the genotypes between genotyped individuals simple. Explicit centering of the genotype matrix M using the Q c matrix, i.e., Z = M − 2Q c P c ′ , where P c is an m by s matrix that has allele frequencies in the s groups for the m markers, becomes void using the multiple group J factor by the Q c matrix and follows from generalizing the development of Fernando et al. [7] and Strandén and Christensen [9]. For example, when the Q c matrix has breed proportions, the use of breed-wise allele frequencies for centering in the genomic relationship matrix [15] will give the same GEBV as those that use an allele frequency of 0.5 for all markers provided the same scaling is used.
Rows in the Q c coefficients matrix can have fractions of the base group proportions for the genotyped animal, which are calculated using pedigree information similarly to the coefficients in the Q 2 matrix for the genotyped animals in the Q matrix for the unknown genetic groups. The J factor effects become confounded with the genetic group effects when Q c equals Q 2 , and all phenotyped animals have been genotyped. When the number of phenotyped animals without genotype information is small, there may be a situation close to collinearity with the genetic group and J factor effects since these effects will try to model the same effect. This is unlikely in many current breeding populations with long recording history and with many phenotyped animals without genotype information. However, some new traits such as greenhouse gas emission measurements have been recorded only recently and are likely to be from genotyped animals only. In the case when almost all the phenotyped animals have been genotyped, the J factor effect could be treated as operationally random. Otherwise, the J factor would be inseparable from the overall mean and the results may be meaningless. However, the Q c and Q 2 matrices do not need to be the same. For example, the Q 2 matrix can have genetic groups based on breed, birth year, country of origin, and sex but the Q c matrix can have fewer classes due to a pedigree that traces back far with distinct sub-populations, which can lead to the J factor coefficients in the A 12 A −1 22 Q 2 matrix for some genetic groups to be zero or close to zero. In the extreme, when Q c equals 1 , our generalization reduces to the J factor in Fernando et al. [7]. where the left-hand side has the breeding value estimates a d calculated as linear function of the J factor, genetic group and genetic effect solutions. Let C and r be the coefficient matrix and the right-hand side vector in MME Eq. (2), respectively. In the QP transformation, the MME are transformed to be P −1 ′ CP −1 v d = P −1 ′ r . MME of the QP transformed ssGBLUP are: The term H −1 J in the MME Eq. (3) can be simplified. First, note that: Thus, the MME Eq. (3) can be written as: where F = 0 −G −1 and Q 2 are the rows of matrix Q pertaining to the genotyped animals. Thus, the coefficients to the regression effect c involve only functions of Q c and G −1 , and no longer neither matrix J as in the MME Eqs. (2) and (3), nor the pedigree-based relationship matrix as in the MME Eq. (3).
Assuming that Q c ′ G −1 Q c is non-singular, MME Eq. (4) can be further simplified by absorption of the c effect to the other effects.
i.e., the rows in the MME Eq. (4) coefficient matrix for the J factor effect c excluding columns having coefficients for c . This can be rewritten as has non-zero elements only at columns for the genetic groups ( Q 2 ) and breeding values of genotyped animals ( I ). Because the righthand side values in the MME Eq. (4) corresponding to c are zero, the absorption changes only the coefficient matrix. The change due to the absorption is Because matrix K Q operates only on the coefficients of the genotyped animals and the genetic groups through Q 2 , the MME Eq. (4) after absorption of the J factor effect is changed as: and G * J = G −1 + K c . Thus, the J factors can be accounted in MME by changing the G −1 matrix without having to solve explicitly the regression effects c and to calculate the J matrix.
The absorption of the J factor effect in MME Eqs. (4) and (5) is less than the number of groups, i.e., there are linearly dependent groups. Observe also that G * J Q c = 0 and that G * J GG * J = G * J . Thus, the G matrix is by definition a generalized inverse of G * J . Note that the G * J matrix in the MME Eq. (5) is a computational result from absorbing the J factor effect, not an inverse of a genomic relationship matrix. In particular, G * J is singular as can be easily proved by observing that application of the Woodbury formula to invert will require the inversion of a singular matrix, i.e., a matrix of zeros.

Special cases
An important special case in MME Eq. (5) is to have Q c = Q 2 , i.e., the same groups are used for centering and for the unknown genetic groups. Because now G * J Q 2 = G * J Q c = 0 , MME Eq. (5) can be written as: Note that in MME Eq. (6) the genomic relationship matrix G makes no contribution to the coefficients involving the genetic group effects g because matrices E and Q are not functions of the G matrix.
Another special case is the original J factor model in [7] with Q c = 1 , where c is a scalar valued regression effect. This will illustrate the MME in ssGBLUP when the original J factor of Fernando et al. [7] is used. Then, the absorption of the c effect in MME Eq. (4) gives MME Eq. (5) but with G * J = G −1 + K 1 and As before, G * J 1 = 0 , i.e., matrix G −1 has been replaced by G * J , where the rows and columns sum to zero. However, when Q 2 = 1 , the G * J Q 2 product can be different from zero. Thus, genomic data can influence coefficients of the genetic groups and the diagonal matrix for the genetic groups is Q ′ H * J Qσ −2 a + S −1 as in MME Eq. (5).

Computational considerations
In the derivations above, the genomic relationship matrix has the form G = ZD −1 Z ′ . When the number of genotyped animals is larger than the number of SNPs, this G matrix becomes singular. Common non-singular forms of the genomic relationship matrix are G C = G + C , where the regularization matrix C is non-singular, easy to invert and independent of genomic information [16,17]. Examples of such constrained genomic relationship matrices are G ε = G + εI and where ε is a small number and w is the so-called residual polygenic proportion. There is always an equivalent single-step SNP-BLUP type of model for these ssGBLUP models [18]. Although the regularization matrix is not needed to avoid the singularity problem in single-step SNP-BLUP, a counterpart to the regularization matrix C is an independent random effect having a covariance matrix C . In case C = wA 22 , the genotyped animals have genomic and pedigree information weighted by a residual polygenic proportion, and in case C = εI , an independent and identically distributed random effect is added to each genotyped animal. Thus, the earlier derivations for the J factor are valid and allow to account for any differences due to centering in G by the different allele coding by ssGBLUP models using G ε or G w . However, note that differences in scaling, i.e., the D matrix, can lead to differences in GEBV. We will illustrate computational performance and consequences of the derivations using a G w matrix in analysis of a small data set.

Data
Approaches were tested using a subset of dairy cattle fertility data from Nordic Cattle Genetic Evaluation (NAV, Aarhus, Denmark). The data set is described in Matilainen et al. [14]. We considered only the two heifer fertility traits, i.e., nonreturn rate within 56 days after first service (NRR0) and days from first to last insemination (IFL0). The numbers of NRR and IFL0 observations were 6.5 million and 6.2 million, respectively. The pedigree included 5.4 million animals of which 33,969 were genotyped. There were 332 genetic groups which accounted for genetic level by breed, country of origin, and birth year. In the computations, we calculate the G * J matrix and solve MME Eq. (5) for different J factor models, where one of them has Q c equal to Q 2 . Computation of G * J requires that the Q c ′ G −1 Q c matrix is not singular. In other words, the genetic group matrix Q c cannot have linearly dependent rows/columns. The original groups defined and used in Matilainen et al. [14] led to a singular Q c ′ G −1 Q c matrix. Thus, for our study, we combined several adjacent birth year groups and reduced the number of groups from 332 to 232. The bulls were genotyped using the Illumina BovineSNP50 chip and the cows were genotyped using the BovineLD Bead Chip with the genotypes imputed to the 50K chip (Illumina Inc., San Diego, CA, USA).

Study design
The residual polygenic proportion w was 20% in the genomic relationship matrix G w = (1 − w)G + wA 22 , where G = ZD −1 Z ′ was as in VanRaden's method 1 [6].
Two sets of allele frequencies to center the Z matrix were tested. In the first approach, an allele frequency of 0.5 was used for all markers ( p = 0.5 ). The second approach used base population allele frequencies which were estimated using the generalized least square (GLS) approach assuming a single breed [19] as implemented in the Bpop program [20]. The first approach is denoted by 101 and G 101 matrix, and the second approach by PvR1 and G PvR1 . Both G matrices used the same scaling factor k = m/2 where m is equal to the number of markers. The same scaling factor k allowed a scale independent comparison of the centering approaches. The data were analyzed using three ssGBLUP models which had the same genetic groups. Two of the models had J factors, either as a single J factor (J1) or genetic group-based J factors (JQ), i.e., Q c = 1 or Q c = Q 2 , respectively. In the MME, the J factors and the genetic groups were either regression coefficients (reg) or pedigree groups after QP transformation as described earlier. Thus, we performed six ssGBLUP model analyses (Table 1). These models are referenced by the names QP, QPJ1, QPJQ, reg, regJ1, and regJQ. The term J1 will refer to both QPJ1 and regJ1, and JQ will refer to both QPJQ and regJQ. The reg model solved MME (2) and the QP model solved MME (5) with or without a J factor. The computational performance of the ssGBLUP approaches was measured by the number of iterations until convergence and computing time per iteration.

Computations
The MiX99 software was used to solve MME to calculate GEBV using iteration on data and the PCG method [20] with a block diagonal preconditioner. The computations accounted for the inbreeding coefficients in A −1 and A 22 . The PCG method was assumed to be converged when C r < 10 −7 where C r is defined as the Euclidean norm of the difference between the right-hand side (RHS) of the MME and the one predicted by the current solutions relative to norm of the RHS: where s [k] 1 is the solution vector at round k , C is the MME coefficient matrix, and r is the MME right-hand side vector.
In the reg models, the regression coefficient matrices WQ for the genetic groups and WJ for the J factors were precomputed and read from disk. The Q matrix was calculated based on pedigree information and this computation was fast (17s) using the RelaX2 program [22]. Two implementations for the WQ matrix in solving MME Eq. (2) were tested. In the first approach, the WQ matrix was considered as a dense matrix, and in the second approach, it was read to memory as a sparse matrix.  where A ij is the sub-matrix ij of A −1 , = .001/.999 . Thus, the equations need to be solved for every J factor group. The equations were solved by MiX99 using the convergence limit C r < 10 −9 . The ratio corresponds to the ratio of residual and genetic variances in an animal breeding MME. We used a small value, which corresponds to having a high heritability. Consequently, the solutions j * 2 will be close to the right-hand side −v . For safety's sake, the precomputed values in −v were used as J factors for the genotyped animals instead of the j * 2 vector values.

A 11
A 12

Results
Correlations of GEBV between the corresponding regression and QP transformation models for the genotyped and for all the animals were 100.00% for QP and QPJ1, but 99.99% for QPJQ, and the linear regression of GEBV by the QP model on the reg model had an intercept of 0 and a slope of 1. Thus, the regression and QP models resulted in the same GEBV. Correlations between GEBV solutions from different allele coding approaches were 100.00% between the J1 factor models and between the JQ models. Correlations of GEBV between the QP/reg and the J1 models were high, 100.00% for all animals and 99.99% for the genotyped animals. However, the JQ models gave GEBV that were distinctly different to those of the other models, the correlations ranged from 98.78 to 98.96% for all animals and from 83.95 to 85.74% for the genotyped animals. In other words, the use of either allele coding ( G 101 or G PvR1 ) did not affect GEBV results when a J factor was included in the model, and the full QP and reg models gave the same GEBV, as expected.
In the JQ models, using either the 101 or PvR1 allele coding, the GEBV were the same. Likewise, the G −1 J matrix was the same irrespective of allele coding. However, the G −1 matrices were different by allele coding. The change needed in the G −1 matrix by allele coding to arrive to the G −1 J matrix is measured by the K Q matrix. Figure 1 illustrates differences in the elements of the K Q matrix values between the two allele coding approaches. Values close to 0 on the x-axis mean that the elements have not changed much from the G −1 101 matrix. The change from G −1 101 to G −1 J was mostly between -1 and 1 with an average of 0. The y-axis shows differences in the K Q matrix elements of PvR1 minus 101 allele coding, i.e., elements of K Q,PvR1 − K Q,101 . Values on the y-axis are mostly lower than 0. Thus, the G −1 matrix calculated by using base population allele frequencies had to be  Table 2 Number of iterations until convergence in single-step GBLUP when the centering of the markers for the genomic relationship matrix used base population allele frequencies (before the "/" sign) or an allele frequency of 0.5 for all markers (after the "/" sign) Groups: the model had genetic groups; Groups + J1: the model had genetic groups and a single J factor; Groups + JQ: the model had genetic groups and group-wise J factors; reg, dense: regression coefficients for genetic groups and J factors in a dense matrix; reg, sparse: regression coefficients for genetic groups in a sparse matrix and J factors in a dense matrix; QP: genetic groups and J factors by QP transformation changed more than the G −1 101 matrix in order to arrive to the same G −1 J matrix. The average absolute diagonal and off-diagonal elements of the K Q,PvR1 ( K Q,101 ) matrix were 8.99 × 10 -4 (8.97 × 10 -4 ) and 4.47 × 10 -2 (4.42 × 10 -2 ), respectively, with standard deviations of 2.58 × 10 -3 (2.50 × 10 -3 ) and 0.234 (0.229), respectively. One would have expected the base population-based genomic relationship matrix G PvR1 to show smaller change than the G 101 matrix. A reason for G −1 PvR1 to show slightly larger changes than G −1

101
can be due to the use of an incorrect scaling factor in G PvR1 to allow the JQ models to reach the same GEBV. The used scaling factor m/2 is larger than the more correct m l=1 2p l (1 − p l ) , which would lead to K Q,PvR1 that equals multiplying the current K Q,PvR1 matrix by m l=1 2p l (1 − p l ) 2 m when there is no RPG component. This multiplier is 0.69 using marker information from our data.
The number of iterations until convergence varied from 1531 to 2979 ( Table 2). Note that the results are based on fertility data and a two-trait model for complex low heritable traits. The number of iterations varied more with the regression-based models than with the QP models. The reason is that the convergence criterion showed larger round-to-round changes in the regression models than in the QP models which led the convergence statistic to be reached more sporadically (Fig. 2). The larger variation in the convergence statistic with the regression models suggests that they could benefit from a better preconditioner. In all the analyses, the preconditioner was a block diagonal matrix with two-by-two trait blocks within each level of each effect. Apparently, compared to the QP models the reg models had larger off-diagonal values relative to the diagonal in the MME. Nevertheless, the regression model with J1 showed the smallest number of iterations until convergence.
Computing times per iteration for solving GEBV were shorter for the QP models than for the regression models ( Table 3). The WQ matrix was sparse with 6% of its elements being non-zero while the WJ matrix was dense with 95% of its elements being non-zero. When sparse matrix computations were used for the WQ matrix, the regression models were almost as fast as the QP model except for JQ because of the dense WJ matrix computations.

Discussion
We used data on dairy cattle fertility and a two-trait model to illustrate the computational performance of the equivalent MME Eqs. (2) and (5). The observed differences in computing times per iteration (Table 3) are due to the number of multiplications in the MME coefficient matrix times a vector product that is needed in the PCG iteration. Differences in the numbers of multiplications per iteration for the QP and reg models in the computation of the MME coefficient matrix times a vector can be estimated. In MME Eq. (5) of the QP model, the difference in the number of multiplications is mostly due to the genetic groups related to the coefficient matrices Q 2 ′ G −1 Q 2 , −Q 2 ′ G −1 and −G −1 Q 2 which were precomputed in our study. The precomputation allows a computationally simple implementation of the solver program where these precomputed matrices can be included into the G −1 matrix file and used with the same pedigree groups in many evaluations without the need for the solver to compute them for each evaluation. In the PCG iteration, the number of multiplications in the product of these matrices times a vector is  Table 3 Computing time (seconds) per iteration to solve singlestep GBLUP with base population allele frequencies in the genomic relationship matrix Computing times in parentheses use an allele frequency of 0.5 for all markers Groups: the model had genetic groups; Groups + J1: the model had genetic groups and a single J factor; Groups + JQ: the model had genetic groups and group-wise J factors; reg, dense: regression coefficients for genetic groups and J factors in a dense matrix; reg, sparse: regression coefficients for genetic groups in a sparse matrix and J factors in a dense matrix; QP: genetic groups and J factors by QP transformation r r + 2n g , where r is the number of genetic groups and n g is the number genotyped animals. In MME Eq. (2) of the reg model, the difference in the numbers of multiplications is due to the regression coefficient matrices WJ and WQ , which are not present in the QP model. In order to estimate the number of multiplications, note that the implemented PCG iteration used computation by parts in iteration on data as described in Strandén and Lidauer [21]. For the genetic groups in the reg model, every PCG iteration required the product Q ′ W ′ R −1 d y , where d y = Xd b + WJd c + WQd g + Wd a . The terms with WJ and WQ are not included in the QP model. Let us assume that the data file has the rows in the J and Q matrices corresponding to the observations, i.e., n y by s matrix J W = WJ and n y by r matrix Q W = WQ . Then, in the calculation of d y , the J factors ( J W d c ) and the genetic groups ( Q W d g ) require n y s and n y r multiplications, respectively, where n y is the number of observations. The multiplication Q W ′ R −1 d y requires n y r multiplications when the multiplications by the R −1 matrix are ignored. Thus, in total n y (s + 2r) multiplications are required. When J factor computations are not present in the reg model, i.e., s is equal to zero, the number of multiplications in the reg model ( = 2n y r ) is larger than in the QP model ( = r r + 2n g ) when n y > n g + 1 2 r . Thus, in practice, the number of multiplications per PCG iteration in a ssGBLUP model with QP is often smaller than in the corresponding reg model.
A sparse WQ matrix allows to decrease the reg model solver computing time. Let us consider the differences in the number of multiplications in PCG between the QP and reg models. When there are no J factors and the WQ matrix has a sparsity of p , the reg model has 2n y rp multiplications not included in the QP model. Thus, the QP model has less multiplications than the reg model when n y > 1 p n g + 1 2p r . For example, assuming 5% of non-zeros in the WQ matrix would have n y > 20n g + 10r , i.e., when the ratio between the number of genotyped and phenotyped animals is higher than the density of nonzeros in the Q matrix, the QP model has more multiplications than the reg model. However, in practice, the difference in computing time can be small when the number of genotyped animals is large. In this case, most of the computing time is due to the genomic relationship matrix.
The QP model has an added computational preprocessing cost due to the calculation of Q 2 ′ G −1 Q 2 , −Q 2 ′ G −1 and −G −1 Q 2 . The number of multiplications to calculate these matrices is rn g r + n g when we note that the computational result from the two latter matrices ( G −1 Q 2 ) is an n g by r matrix and can be used in the computation of the first matrix. This computational cost is small, because inversion of the G matrix is much more demanding since there are typically more genotyped animals than groups. Furthermore, these matrices are calculated only once but the numbers of multiplications given in the previous paragraphs are computed for each PCG iteration. Both of our genomic data sets were so small that we did not see any practical difference in computing time due to QP when making the augmented G −1 matrix having The same was true when making the J1 adjustment to G −1 . However, making the JQ adjustment to G −1 doubled the computing time. This increase in computing time was not significant compared to the total computing time.
Previous studies have recommended making the genomic relationship matrix compatible with the pedigree-based relationship matrix [3,13]. The use of a J factor allows the calculation of allele-coding-free GEBV. Hence, the compatibility in a ssGBLUP model with a J factor means compatibility in scaling the marker matrix, which was the same in all our G matrices. Thus, while the J factor removes the necessity to center the marker matrix, proper scaling is still required. When centering and scaling use base population allele frequencies, the recommended scaling factor for a single breed in [6] is m l=1 2p l (1 − p l ) instead of m/2 as used in our study. The use of a J factor will give GEBV that differ from those based on a G matrix where the base population allele frequencies have been estimated using the GLS approach as in this study. There is some evidence that a J factor can have a positive impact on the accuracy of prediction [8]. Correlations of GEBV for the genotyped animals between the JQ and J1 models were only about 85%, suggesting a notable difference in prediction ability. However, the accuracy of prediction in a multiple J factor model has not yet been studied. Thus, further work is necessary to assess the effect of a J factor on the predictability of GEBV but also the theoretical consequences of its use.
We used J factors to be able to calculate allele-codingfree GEBV, which parallels the work in [9] for the GBLUP and SNP-BLUP models. As in their study, the allele-coding-free GEBV calculated by the J1 and JQ models do not allow the computation of individual animal-based reliabilities that are allele-coding-free because the J factor effect cannot be included into the individual animal genetic variance term that is used as a denominator in the reliability equation. Diagonals of the G matrix from the J1 or JQ model depend on allele coding, likewise, the H matrix depends on allele coding even in the J1 and JQ models. Thus, computation of individual animal reliabilities of GEBV for the ssGBLUP model is relative to allele coding as in GBLUP and SNP-BLUP models even if allele-coding-free GEBV can be computed. When the J factor is considered random, then we can include the J factor into the G matrix and have well-defined genetic K c term can be negligible or give even better predictability than the QP transformation model without a J factor.
The presented QP transformation for the J factor and consequent absorption yielding MME Eq. (5) is computationally simple for the single-step type of models where the inverse of the genomic relationship matrix G −1 is explicitly computed. When the number of genotyped animals is large, the G matrix will take too much memory and the G −1 J matrix can no longer be calculated. A memory efficient single-step model alternative is ssGTBLUP [16], where the genomic relationship matrix is assumed to have the form G = ZD −1 Z ′ + C and its inverse is G −1 = C −1 − T ′ T with the rectangular matrix T of size equal to number of SNPs times number of genotyped animals. The absorption term K c in MME Eq. (5) can be implemented for ssGTBLUP. Note that when G −1 = C −1 − T ′ T as in [16]: where T K = L −1 Q c ′ C −1 − T ′ T is an s by n g matrix and L is an s by s matrix from Cholesky decomposition of Q c ′ C −1 − T ′ T Q c . Thus, G * J = C −1 − T ′ T − T K ′ T K , which allows an easy and efficient PCG implementation. An alternative approach is to use MME Eq. (4). Then, as described for genetic groups in Koivula et al. [29], the T matrix can be augmented with TQ c in order to have the c effect. However, some additional computations are needed because the T matrix does not contain the computations due to the C −1 matrix of G −1 . The required computations due to the terms Q c ′ C −1 Q c and C −1 Q c can be done by using precomputed matrices.

Conclusions
The use of a J factor effect allows to compute GEBV in ssGBLUP and ssSNPBLUP that are independent of the allele coding used to center the marker matrix. We extended the single J factor regression to multiple groupbased J factor regression effects. We used transformation in the MME of the ssGBLUP model to change the regression effect-based J factors to be correlated with genetic effects only. This showed a conceptual similarity of the J factors with the genetic groups which after a similar transformation can be used to augment the relationship matrix information. Furthermore, the transformation gave MME where the J factor coefficients do not need to be computed. When the number of J factor groups is large, solving the regression effect-based J factor MME can be computationally much more demanding than the transformed MME. Using the same regression coefficients for the J factor coefficients of the genotyped animals as for the genetic groups, we showed that the transformed MME in ssGBLUP no longer required the genomic relationship matrix to be accounted for in the genetic group equations when the J factor effects had been absorbed. We tested different J factor models for the analysis of a dairy fertility data set. We observed that GEBV were the same within a J factor model regardless of the allele coding approach as predicted by our derivations and that the QP transformed MME were computationally more efficient than the original regression-based MME. Further work is needed to assess predictability and proper individual reliability of GEBV when the model has a J factor.